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[57] ABSTRACT 

A multiplier which uses Booth recoding to multiply large 
word length operands. A first operand is fully loaded into a 
shift register. The loading of the second operand is then 
begun, with the recoding operation beginning after the 
loading of the minimum number of bits of the second 
operand required for the first stage of the recoding. The 
recoded portions of the second operand are used to select 
what factor of the first operand to use in forming the partial 
product terms. The partial product terms are added using 
carry save addition, with the least significant bits being used 
to form the least significant bits of the final product. The 
most significant bits of the final product are then formed by 
adding the carry save data from the partial product summa- 
tions. The present invention performs squaring operations 
used in exponentiation functions by shifting the first operand 
value (A) by one bit to form twice that value (2* A) prior to 
multiplying by the second operand (B) to form the 2*(A*B) 
term needed in such calculations. This shifting is performed 
in the multiplexer used to select the appropriate factor of the 
first operand for each partial product term, rather than after 
the accumulation of the final product term. 

4 Claims, 6 Drawing Sheets 
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BOOTH MULTIPLIER WITH SQUARING 
OPERATION ACCELERATOR 

TECHNICAL FIELD 

The present invention relates to architectures for large 
operand length multipliers, and more specifically, to an 
apparatus which calculates the square of the sum of two 
operands using the Booth multiplication algorithm in a faster 
manner than currently used multipliers. 

BACKGROUND OF THE INVENTION 

Many data processing applications require that two oper- 
ands be multiplied together. In particular, signal processing 
and data encryption applications depend on high speed 
multiplication operations, often with large word length oper- 
ands. 

The product of two operands is typically obtained through 
successive additions of shifted strings of bits, with each 
string representing an intermediate or partial product of one 
operand with a term from the other operand. The interme- 
diate product terms are summed to obtain the final result. 
The product (P) of two operands (X and Y) can be repre- 
sented as: 

P-X Y-Xxly,S-Xxy/, (1) 

where y, is the value of the ith bit of the Y operand, r is the 
radix for the number system representation used, and the 
summation runs from i=0 to n-1, with n being the number 
of bits in the Y operand. 

Equation (1) indicates that the multiplication operation is 
equivalent to the summing of n terms of the partial product 
(Xxy.rO. For a binary number representation system, the 
radix equals 2 and y t - equals either 0 or 1 . The ith term in the 
sum is then obtained by a left shift of operand X for i bit 
positions and multiplication by the digit y,-. The n terms are 
then summed. 

Booth Recoding is a well known method for multiplying 
unsigned or two's complement numbers. The method is 
based on the observations that a string of zeros in an operand 
requires no addition of the partial product terms, just a 
shifting of the previous partial product, and that a string of 
ones in the multiplier extending from bit 2 P to 2 q (q>p) can 
instead be treated as the value 2 <?+1 -2 p . These observations 
have led to the development of a faster method for perform- 
ing multiplication operations. 

Booth's method is carried out by the following steps. Let 
x ( be the ith bit of an n-bit multiplier X. Bit x n-1 is the most 
significant bit and x 0 is the least significant bit, A bit x_j=0 
is assumed in order to provide closure of the method. The 
multiplicand is Y. Starting with i=0, bits x ( and x,-_j of the 
multiplier are compared. Based on the comparison, the 
indicated action is performed; 



x i 




Action 
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Shift Y left with respect to 






partial product 
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Add Y to partial product, 






then shift Y 
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Subtract Y from partial 






product, then shift Y 


1 


1 


Shift Y 



This process is repeated until n comparisons are completed. 
The result is the product of the two operands. 

The above description of Booth's method is based on 
comparing two bits of one of the operands at a time. If a 
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higher radix value is used, extensions of the method can be 
made to comparisons of three or more bits. This will further 
increase the speed with which the multiplication operation is 
implemented. For example, given two operands expressed as 
5 base 4 (modulo 4) numbers, then if three bits of the 
multiplier X are examined during each comparison, the 
multiplicand terms to be added or subtracted are 0, Y, -Y, 
2Y, and -2Y. The table below shows the appropriate factor 
to add based on a comparison between bits i+1, i, and i-1 of 



15 



the multiplier 


operand X: 






Current Pair 


Previous Bit 




i + 1 
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i - 1 


Factor 
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0 
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+Y 


0 
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0 


+Y 
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+2Y 
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-2Y 
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-Y 
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0 


-Y 


1 


1 
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0 



FIG. 1 is a block diagram of a prior art circuit for a 
multiplier 10 which uses Booth's recoding method to mul- 

25 tiply two operands. The multiplication operation executed 
by multiplier 10 can be described in terms of three process- 
ing stages. During the first stage, data representing operands 
A and B is loaded. During the second stage, operand B is 
shifted in groups of bits (where each group contains 4 

30 different bits in the case of a modulo 4 recoder) into a Booth 
recoder, the operand is receded, and the resultant partial 
product terms are formed and accumulated. The accumula- 
tion phase produces partial sum and carry save data for the 
sums of the partial products. This stage produces 4 bits of the 

35 final product per clock cycle by using a 4 bit carry look 
ahead adder to combine the least significant bits of the 
partial products. The final product data is stored in a 512 bit 
accumulator. The stage continues until all of operand B has 
been recoded (256 bits in the case of this example), with the 

40 256 bits of final product data generated forming the 256 least 
significant bits of the final result. In the final stage, the final 
partial sum and carry save data is added together to produce 
the 256 most significant bits of the final result. The circuit 
elements used to implement each of the three stages will 

45 now be described. 

The data representing operands A and B is input by means 
of 32 bit data bus 12. The multiplicand operand A data is 
retrieved from bus 12 and loaded into 256 bit shift register 
14, in 32 bit groups, one group with each clock cycle, where 

50 clock signal (CLKS) 15 controls the loading of the 32 bit 
data groups. As operand A is 256 bits in size in this example, 
8 clock cycles are required to complete loading it into 
register 14. 

Operand A multiplexer 13 is used to control the loading of 
55 data into register 14, and in particular, to maintain the 
register in an idle state after the operand A data has been 
loaded and the other operations of the multiplier are being 
executed. Multiplexer 13 has two inputs: a first input signal 
which instructs the multiplexer to load operand A data, 
60 shifting 32 bit wide groups of operand A data into register 
14; and a second input signal which instructs the register not 
to shift the data being loaded. The no-shift control signal is 
used during the clock cycles after operand A has been fully 
loaded in order to maintain the entire operand A data in the 
65 register. This capability is needed because clock signal 15 is 
continuously provided to register 14, which causes the 
contents of the register to be shifted out with each clock 
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cycle. Therefore, multiplexer 13 is used to provide an idle 
state so that the data flow into register 14 is properly 
coordinated with the multiplication stages. In this case, 
multiplexer 13 and a feedback loop are used to maintain the 
full 256 bit operand A data in the register for use with the 5 
Booth recoding process while clock signal 15 is clocking the 
register. 

Operand A multiplexer 13 decodes the load operand A 
data, the shift 32 bit wide data groups, and no shift input 
signals so that the 32 bit shifted data groups of the 256 bit 10 
input or the non-shifted 256 bit input to the multiplexer are 
connected to the multiplexer output. The data shifting func- 
tion is obtained in a known manner by means of the 
connections between the multiplexer and register 14. The 
control signals for selecting which function is implemented is 
by multiplexer 13 are provided by an external sequencer or 
state machine (not shown) in accordance with the phase of 
the multiplication operation being executed. 

After all of the operand A data has been loaded, multiplier 
operand B is then loaded in 32 bit groups into 256 bit shift 20 
register 16, where register 16 is controlled by clock signal 
CLKS 15. Operand B multiplexer 17 is used to control the 
functioning of continuously clocked register 16 in accor- 
dance with the stage of the multiplication operation being 
carried out. Multiplexer 17 has three inputs: a first input 25 
signal which instructs the multiplexer to load the operand B 
data, shifting 32 bit wide groups of the data into register 16; 
a second input signal which instructs the register not to shift 
the data and which is used to produce an idle state; and a 
third input signal which instructs the register to shift the 30 
operand B data out of the register in groups of 4 bits. As in 
the case of multiplexer 13, the control signals for selecting 
which function is implemented by multiplexer 17 are pro- 
vided by an external sequencer or state machine in accor- 
dance with the phase of the multiplication operation being 35 
executed. As operand B is 256 bits in size in this example, 
8 clock cycles are required to complete loading it into 
register 16. Thus, in this example, a total of 16 clock cycles 
are required to load operands A and B into their respective 
registers. Furthermore, because of the design of this 40 
multiplier, the operands must be fully loaded before the 
Booth recoding process can begin. 

The operand B data is shifted out of register 16 in 4 bit 
groups because application of Booth's method using a two 
stage modulo 4 rccodcr (as in the present example) requires 45 
4 bits of operand B for each recoding operation. The 4 bit 
groups of operand B data are transferred to Booth Recode 
Decoder module 18 by means of data bus 19. Booth Recode 
module 18 evaluates multiplier operand B in successive bit 
fields to determine what factor of multiplicand operand A to 50 
use in forming the partial product terms which are added 
together to obtain the final product. Since Booth module 18 
is a two stage recoder, 2 successive bit fields are recoded 
during each clock cycle. Each bit field recoding produces 
two least significant bits of an uncorrected result for the final 55 
product and a modulo 4 carry term. Booth module 18 thus 
produces 4 least significant bits of uncorrected final product 
data and 2 modulo 4 carry bits per clock cycle. As operand 
B is 256 bits long in this example, it takes approximately 64 
clock cycles (256 bits/4 recoded bits per cycle) to recode the 60 
entire operand. 

The result of the recoding operation is a control signal 
which instructs Booth module 18 to select the appropriate 
factor of operand A (0, A, -A, 2A, or -2A) to use in forming 
the partial product terms. Since two recode stages are used 65 
in Booth recoder 18 of this example, Booth recoder 18 
outputs two factors of operand A each clock cycle. 



One factor of operand A serves as an input to Partial 
Sum/Carry Save (PS/CS) Adder Array 0 20, while the 
second factor of operand A serves as an input to Partial 
Sum/Carry Save (PS/CS) Adder Array 1 22. Thus, as each 
group of 4 different bits of operand B is recoded during a 
clock cycle, two factors of operand A are selected and 
transferred to adders 20 and 22. 

Each of the two PS/CS adders 20 and 22 produces a 260 
bit partial sum and a 260 bit partial carry term. As each of 
the 260 bit wide partial product terms (the factors of operand 
A) are provided to adders 20 and 22, they are added to the 
results of the previous addition operation performed by the 
adders. This results in a new partial sum term and a new 
carry save term. The two least significant bits of the partial 
sum term and the least significant bit of the carry save for 
each addition operation are provided to 4 bit full look ahead 
carry adder 24. As both adders 20 and 22 are generating 
partial sum and carry save terms during each clock cycle, 
two sets of least significant partial sum and carry save bits 
are provided to adder 24, for a total of 4 least significant bits 
of partial sum data and two bits of carry save data. This data 
is combined in adder 24 with the modulo 4 carry bit 
generated by each recoder stage of Booth recoder 18. 

Adder 24 adds the 4 least significant bits of the partial 
sums produced by adders 20 and 22 during a clock cycle to 
the 2 carry save bits and the 2 bits of modulo 4 carry data 
provided by Booth recoder 18. This produces 4 bits of the 
final product term. Each 4 bit group of final product data 
produced by adder 24 is shifted into multiplexer 26 which 
loads 512 bit accumulator 28. 

Multiplexer 26 has four different control signals as inputs: 
a signal which instructs accumulator 28 to shift the data 
input by 4 bits; a signal which instructs accumulator 28 to 
shift the data input by 32 bits; a signal which instructs 
accumulator 28 not to shift the data; and a signal which 
instructs accumulator 28 to shift the data by 1 bit. As adder 
24 produces 4 bit groups of the final product, multiplexer 26 
controls the loading of accumulator 28 with the data by 
shifting the data by 4 bit increments. When operand B is 
completely recoded and the partial products accumulated, 
the lower 256 bits of 512 bit accumulator 28 will be filled. 
The shift data by 32 bits function is used to dump the 
accumulator data to data bus 40. As discussed previously, the 
no shift function is used to implement an idle state in which 
the data is continually clocked back into accumulator 28. 
This function is needed because the accumulator registers 
are continuously clocked and the accumulator function is not 
utilized during all stages of the multiplication operation. The 
shift data by 1 bit function is used to provide a term of the 
form 2*(A*B) for use in computing the terms in the square 
of the sum of two operands. 

After all of operand B has been recoded, the appropriate 
factors of operand A have been added in adders 20 and 22, 
and the partial sum and carry save data for each cycle has 
been transferred to adder 24, registers 30 and 32 contain the 
most significant bits of the carry save operations performed 
on the operand A factors. CS register 30 is 260 bits in size 
and is clocked by clock signal 15. PS register 32 is 260 bits 
in size and is similarly clocked by clock signal 15. The 
contents of CS register 30 and PS register 32 are used to 
implement the final addition operation which produces the 
upper 256 bits of the final product, CS shift register 30 and 
PS shift register 32 are loaded under the control of multi- 
plexers 34 and 36, respectively. 

r l*he final addition stage is performed using the same 
adders as were used to produce the lower 256 bits of the final 
product. The contents of registers 30 and 32 are fed back into 
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adder 20 by means of data busses 33 and 35, with adder 20 multiplication operation that would otherwise be required, 

transferring data to adder 22 by means of data bus 37. As and provides an increase in the speed with which the 

operand B has been completely recoded, operand B register calculation can be performed. Another method of calculating 

16 contains all zeros. Thus the adders are performing an me 2AB tefm fa tQ form ^ A , fi duct tefm afld thefl ^ 

operation equivalent to (A* 0+CS+PS). After adders 20 and 5 , . . % • , , . flt , t , -,* /A *r»x 

22 are loaded with the contents of register. 30 and 32, the the te ™. b * 0ne blt r m acc u umulat ° r 28 *> fo ™ the 2 *( A * B ) 

multiplier unit is cycled through the 64 cycles normally lerm ' 71115 15 eveD faster lhan Performing the extra addition, 

required to accumulate the partial products. However, However, this method has the disadvantage that the circuitry 

because in this situation operand B is zero, the effect of the used for performing the shift must be capable of handling a 

cycling is to add the contents of registers 30 and 32. 10 512 bit shift, and hence consumes a large amount of die area 

The result is that during each cycle, the 2 least significant and is expensive to implement, 

bits from each of adders 20 and 22 are added together in 4 Another feature of the multiplier of FIG. 1 is that a single 

bit adder 24 to produce a 4 bit group of the most significant clock signal is used to control the shifting of data into shift 

bits of the final product. Each 4 bit group of the most registers 14, 16, 30, and 32, and accumulator 28. Thus, all 

significant bits of the final product is loaded into 512 bit 15 data loading and processing functions for the multiplication 

accumulator 28 using the 4 bit shift instruction of multi- operation are continuously clocked by a common clock 

plexer 26. After accumulator 28 is loaded with the 256 most signal, with multiplexers used to produce an idle state so as 

significant bits of the final product term, the multiplication to maintain the status of the registers after the data has been 

operation is complete. The data is clocked out of accumu- 2 q loaded. As this design uses synchronously clocked circuitry, 

lator 28 in 32 groups and placed on data bus 40. power consumption is dependent upon the clock frequency. 

In the multiplier of FIG. 1, operands A and B must be Since a high clock frequency is desirable for fast processing 

completely loaded into registers 14 and 16 before the Booth operations, this feature results in a high level of power 

recoding operations are commenced. Given a data bus of consumption. 

width d which can transfer d bits per clock cycle, if the 25 What is desired is a multiplier capable of calculating the 

operands are m bits long, then this design requires 2m/d square of the sum of two operands using the Booth recoding 

clock cycles to transfer the operands into the registers. This method which is implemented in a faster and more efficient 

means that 16 clock cycles are required to load two 256 bit architecture than currently used multipliers. These and other 

operands into their respective registers, assuming the oper- 3Q advantages of the present invention will be apparent to those 

ands are loaded 32 bits at a time. This delays the start of the skilled in the art upon a reading of the Detailed Description 

operand processing until the completion of the 16 clock of the Invention together with the drawings. 

Cy ^T' w r a - fcir f . , . 4 . ,. t SUMMARY OF THE INVENTION 
The multiplier design of FIG. 1 is typical in that it uses 

carry-save addition and registering to minimize circuitry and 35 The present invention is directed to a multiplier which 

increase the multiplication rate. High speed multiplication uses Booth recoding techniques to multiply large word 

and exponentiation operations require large Booth adder len S th operands. The architecture of the multiplier is such 

arrays having large partial sum and partial carry registers. tnat il implements the multiplication operation in a faster 

Multiplying two m bit operands using a radix 4 Booth and more efficient manner than typical architectures used for 

recoding multiplier requires approximately m/(2n) clock lrie same purpose, 

cycles to generate the least significant half of the final A first operand is fully loaded into a shift register. The 

product, where n is the number of Booth recoder adder loading of the second operand is then begun, with the 

stages. The number of Booth recoder adder stages is equal recoding operation beginning after the loading of the mini- 

to the number of bit groups which are recoded during a 45 mum number of bits of the second operand which are 

single clock cycle. After these m/(2n) cycles, the most required for the first stage of the recoding. The loading of the 

significant upper half of the product is obtained by summing second operand continues while the previously loaded por- 

the contents of the partial sum and partial carry registers. As tions of tQ e operand are recoded and the partial products 

noted, this final addition is typically executed using the same based on those recoded portions are generated and summed. 

Booth adders as were used to accumulate the partial prod- 50 The recoded portions of the second operand are used to 

ucts and carry terms in the previous stages of the multipli- select the factor of the first operand to use in forming the 

cation operation. partial product terms. The partial product terms are added 

An important aspect of the multiplier design of FIG. 1 using carry save addition, with the least significant bits being 

relates to the manner in which it performs exponentiation 55 used to form the least significant bits of the final product, 

operations which are often used in encryption applications. The most significant bits of the final product are then formed 

It is well known that exponentiation operations can be by adding the partial sum and carry save data from the 

accelerated by performing squaring operations. Thus, in partial product summations. 

some cases it is desirable to efficiently calculate the terms in The multiplier performs squaring operations used in expo- 

the expression for the square of the sum of two operands. 60 nentiation functions by shifting the first operand value (A) 

The multiplier of FIG. 1 typically performs a squaring by one bit to form twice that value (2* A) prior to multiply- 

operation of the sum of operands A and B (where [A+B] 2 = ing by the second operand (B) to form the 2*(A*B) term 

A 2 +2AB+B 2 ) by adding the product term A*B twice to the needed in such calculations. This shifting is performed in the 

accumulator. Thus, this type of multiplier calculates the 65 multiplexer used to select the appropriate factor of the first 

intermediate term in the form (A*B)+(A*B). This approach operand for each partial product term, rather than after the 

uses an extra addition operation to replace the second accumulation of the final product term. This reduces the cost 
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of implementing a squaring operation when compared to 
methods in which the shifting is done in the final accumu- 
lator. The multiplier also contains an accumulator having a 
shifting function which is used to form the square of the sum 
of the first and second operands from the accumulated Booth 5 
partial products. 

The clock signals used to control the data processing 
operations and flow of data through the registers and adders 
are gated so that those registers which are needed for the no 
stage of the multiplication operation being executed are 
clocked, while the other registers are not enabled. This 
reduces the power consumed during the multiplication 
operation when compared to an architecture in which a 
common clock signal is used to synchronously clock the 15 
circuitry. The result is a multiplier design which is faster, 
conserves power, and requires less circuitry than present 
multipliers based on the Booth recoding method. 

Further objects and advantages of the present invention 20 
will become apparent from the following detailed descrip- 
tion and accompanying drawings. 

BRIEF DESCRIPTION OF THE DRAWINGS 

FIG. 1 is a block diagram of a prior art circuit for a 
multiplier which uses Booth's recoding method to perform 
the multiplication of two operands. 

FIG. 2 is a block diagram of the 256 bit by 256 bit Booth 
Multiplier of the present invention. 30 

FIG. 3 is a block diagram of the multiplier unit of the 256 
bit by 256 bit Booth Multiplier of the present invention. 

FIG. 4 is a schematic of the circuit of one of the two 
cascaded recoders contained in the Booth Recoder module 35 
of the multiplier unit of FIG, 3. 

FIG. 5 is a diagram showing the connections between the 
adder arrays and the partial sum/carry save registers of the 
multiplication unit of the present invention. 

FIG. 6 is a more detailed block diagram of the multiplier 
unit of the 256 bit by 256 bit Booth Multiplier of the present 
invention. 

DETAILED DESCRIPTION OF THE 45 
INVENTION 

FIG. 2 is a block diagram of the 256 bit by 256 bit Booth 
Multiplier 50 of the present invention. Multiplier 50 
includes state controller or sequencer 70 which receives 
input commands from processor 60 instructing multiplier 50 50 
to execute one of several basic multiplication functions. 
Slate controller 70 outputs control signals used to generate 
clock signals which clock the various components of mul- 
tiplier unit 100. The clock signals are generated in a manner 5S 
which implements the clock gating features of the present 
invention. 

Upon receipt of an input command, state controller 70 
produces signals to enable the various data processing 
functions which are carried out in executing the desired 60 
multiplication function. This is accomplished by using a 
sequencer which counts system clock cycles and outputs 
function enable signals at the appropriate times, in accor- 
dance with the number of cycles required for each stage of 65 
the data processing executed by multiplier unit 100. The 
function enable signals are provided to a set of clock gating 



8 

control circuits 300. Control circuits 300 output function 
clock signals which are used to clock a register or other 
component of multiplier 100 which performs a particular 
stage of the multiplication operation. 

As noted, the combination of the function enable signals 
produced by sequencer 70 and the actions of control circuits 
300 are used to provide clocking signals for the various 
components of multiplier unit 100. By turning the clock 
signals on and off in accordance with the stages of the 
multiplication operation, power can be conserved when 
compared to synchronously clocked architectures. 

A listing of pseudo code describing the operation of state 
controller or sequencer 70 is attached to this application as 
an appendix. The pseudo code indicates the various function 
enable (and disable) signals produced by state controller 70 
in terms of the number of system clock cycles and the stage 
of operation of the multiplication process. 

FIG. 3 is a block diagram of the multiplier unit 100 of the 
256 bit by 256 bit Booth Multiplier 10 of the present 
invention. The data representing operands A and B is input 
by means of 32 wide data bus 102. The multiplicand operand 
A data is retrieved from bus 102 and loaded into 256 bit shift 
register 104, with 32 bits being loaded with each clock cycle. 
Clock signal A (CLKA) 105 controls the loading of the 32 
data groups for operand A into register 104. 

As operand A is 256 bits in size, 8 clock cycles are 
required to complete its loading into register 104. Multiplier 
operand B could then be similarly loaded into a 256 bit shift 
register in the next 8 cycles as occurs in a typical multiplier 
design. However, application of Booth's method using a two 
stage recoder (as in the present invention) requires only the 
first 4 bits of the multiplier operand in order to begin the 
recoding operation. Thus, instead of waiting for operand B 
to be completely loaded into a 256 bit shift register, the first 
32 bits of operand B are loaded into 32 shift 4 register 108, 
which is controlled by clock signal B0 (CLKB0) 109. These 
32 bits are shifted out to the recoder in groups of 4 bits over 
the next 8 cycles of clock 109. This allows eight multipli- 
cation cycles to occur while the remaining 224 bits of 
operand B are loaded into 224 bit shift register 106, with 
clock signal B (CLKB) 107 controlling the loading of the 32 
bit data groups of operand B into register 106. 

By the time the remaining bits of operand B have been 
loaded into register 106, register 108 has finished shifting its 
original 32 bit group out to the recoder in groups of 4 bits. 
This allows the next 32 bit group of operand B to be loaded 
from register 106 into register 108 in accordance with clock 
signal B 107. The clock cycling continues as register 108 
shifts the new B operand data out to the recoder in 4 bit 
groups upon receipt of each clock signal B0 109. This 
continues until register 108 is empty and the next 32 bit 
group is loaded from register 106 upon receipt of clock 
signal 107. This sequence repeats until all of the 224 bits 
loaded into register 106 have been shifted into register 108 
and acted upon by the Booth recoder in the manner to be 
described. 

As noted, three clock signals, 105, 107, and 109 are used 
to control the loading of registers 104, 106, and 108. 
However, it is not necessary that all three clock signals be 
enabled and actively clocking the circuitry at the same time. 
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Clock A signal 105 is needed for 8 cycles to complete the of operand A used to form the partial product. Select 

loading of register 104 with the bits of the A operand. During multiplexer 114 responds to control signal 112 by producing 

this time, both clock B signal 107 and clock Bo signal 109 the factor of operand A (obtained from register 104) required 

do not need to be actively clocking their respective shift for the partial product term. These control signals 112 are 

registers. Upon completion of the loading of the operand A 5 shown individually in FIG. 4: signal 204 is used to add a 

bits, clock A signal 105 is not needed until the next multi- factor of 0 to the partial product; signal 205 is used to add 

plication operation when new operand A data will be loaded, a factor of A; signal 206 is used to add a factor of 2A; signal 

and hence may be disabled. After loading of the 256 bit 207 is used to add a factor of -A; and signal 208 is used to 

operand into register 104, clock B0 signal 109 is used to load add a factor of -2A. 

the first 32 bits of operand B into register 108. This signal Recoder circuit 200 of FIG. 4 implements the following 

is then used to shift the 32 bits by 4 bits during each truth table based on a comparison of bits 2j +2, 2j+l, and 2j: 

subsequent cycle of clock B0 signal 109. During the time 

clock B0 signal 109 is being used to shift the 32 bits loaded 

into register 108 out to the recoder in groups of 4 bits, clock is 

B signal 107 is used to load the remaining 224 bits of 

operand B into register 106. Clock B signal 107 can then be 

turned off to the register. Thus, clock A signal 105 and clock 

B signal 107 can be turned off to their respective registers 

when those registers are not being loaded. 

The multiplier of FIG. 3 uses control signals to determine 

when a clock signal should be active and clocking a register. 

The clock control signals are generated as needed depending i n the above table, the index j runs from 0 to 1, meaning that 

upon the status of the data being processed by the multiplier. 25 during each clock cycle, the three bit groups of bits 0, 1, 2 

By using multiple clocks whose signals are gated and used and bits 2, 3, 4 are recoded by the cascaded recoders. It is 

as needed, the power consumed by multiplier 100 can be noted thal FIG - 4 de P icls one example of a suitable circuit 

reduced compared to multipliers which synchronously clock ^ rec °? er 20 ?> and t * at ? tbCT ^ " c ™ its may 

„ , , . r . r ' J be used to implement the above truth table without departing 

all of the circuitry using a common clock. from ^ &pki{ Qf the invention _ 

After the first 32 bits of multiplier operand B have been As noted, the output of Booth Recoder 110 is a control 

loaded into B0 register 108, the Booth partial product signal 112 which instructs select multiplexer 114 to use the 

accumulation stage of the multiplication operation begins. appropriate factor of operand A to form the partial product. 

As noted, during each cycle of clock B0 signal 109, 4 bits since two recoders 200 are used in Booth Recoder 110, 

of the contents of register 108 are shifted out of that register 35 select multiplexer 114 outputs two factors of operand A each 

to Booth Recoder module 110 by means of data bus 111. clock cycle. Recode bits 0, 1, and 2 are used to generate the 

Booth Recoder module 110 evaluates multiplier operand B appropriate factor of A which serves as an input to Partial 

in successive bit fields to determine what factor of multi- Sum/Carry Save (PS/CS) Adder Array 0 116 and which is 

plicand operand A to use in forming the partial product terms transferred by means of data bus 115. Recode bits 2, 3, and 

which are added together to obtain the final product. Each bit 40 4 are use d to generate the appropriate factor of A which 

field recoding produces two least significant bits of an serves as an input to Partial Sum/Carry Save (PS/CS) Adder 

uncorrected result for the final product and a modulo 4 carry Array 1 118 and which is transferred by means of data bus 

term. The bit field evaluation is recoded according to the 117, Thus, as each group of 4 bits of operand B is recoded 

Booth method to determine whether a factor of either 0, A, 45 during a clock cycle, two factors of operand A are selected 

-A, 2A, or -2A is used in the current partial product term. anc i transferred to adders 116 and 118. 

Recoder module 110 consists of two three -bit Booth Each of the two PS/CS adders 116 and 118 is a group of 

recoders cascaded together to form a modulo 4 Booth 260 one bit carry-save adders. This means that the carries of 

recoder. Each of the separate recoders examines three sue- each adder are not immediately propagated to the higher 

cessive bits of multiplicand operand B, with the 3 bit fields 50 sum bits to produce a single sum. Instead, the adders 

overlapping by one bit. Thus, recoder module 110 examines produce a 260 bit partial sum and a 260 bit partial carry. As 

5 different bits of operand B during each cycle. As noted, each of the 260 bit wide partial product terms (the factors of 

each of the separate recoders produces 2 least significant bits operand A) are provided to adders 116 and 118, they are 

of uncorrected product data and one bit of modulo 4 carry 55 added to the results of the previous addition operation 

data per clock cycle, so that the two cascaded recoders performed by the adders. The adders are connected in such 

together produce 4 least significant bits of product data and a way that the new factors are appropriately shifted by 2 bits 

2 carry bits per clock cycle. prior to their accumulation with the previous results. This is 

FIG. 4 is a schematic of the circuit of one of the two done in order to account for the fact that the input data is in 

cascaded recoders 200 contained in Booth Recoder module 60 modulo 4 format. 

110 of multiplier unit 100. As indicated in the figure, recoder Each add operation results in a new partial sum term and 

200 has three inputs, labelled Yin<0>, Yin<l>, and Yin<2> a new carry save term. The two least significant bits of the 

202. In accordance with the Booth method, the values of partial sum term and the least significant bit of the carry save 

input bits 202 determine the output of recoder 200. This 65 for each addition operation are provided to 4 bit full look 

output is in the form of a control signal 112 (see FIG. 3) ahead carry adder 124. As both adders 116 and 118 are 

which instructs select multiplexer 114 to provide the factor generating partial sum and carry save terms during each 
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clock cycle, two sets of least significant partial sum and 
carry save bits are provided to adder 124. This gives a total 
of 4 least significant bits of partial sum data and two bits of 
carry save data. This data is combined in adder 124 with the 
modulo 4 carry bit generated by each recoder stage in Booth 5 
recoder 110 which is transferred by means of data bus 142. 

As noted, each clock cycle produces 4 bits of final product 
data after propagation of the operand A factors through 
PS/CS adders 116 and 118. These 4 bits of the product are 10 
obtained by combining the two sets of 2 partial sum bits and 
1 carry save bit produced by the adders. Adder 124 adds the 
4 least significant bits of the partial sums produced by adders 
116 and 118 during a single clock cycle to the 2 partial carry 
bits and the 2 bits of modulo 4 carry data provided by 15 
recoder 110 to produce 4 bits of the final product. Note that 
the two bits of modulo 4 carry data from Booth recoder 110 
are used by select multiplexer 114 to implement the two's 
complement subtraction function used in the recoding and 20 
partial product accumulation stage. 

Each 4 bit group of final product data produced by adder 
124 is shifted into 32 bit shift 4 register 126, which is 
controlled by clock P signal 125. Register 126 is used to 
combine the 4 bit groups of final product data into a 32 bit 25 
segment of final product data. This operation is performed in 
order to reduce the circuitry needed for shifting the product 
terms into the accumulator used to form the final product. It 
also increases the speed with which the final product is 3Q 
formed as compared to typical multiplier designs. 

As each 32 bit group of final product data is completed, 
it is shifted out of register 126 to accumulator multiplexer 
128. The contents of accumulator multiplexer 128 is then 
dumped into 256 bit accumulator 130 which represents the 35 
lower half of a 512 bit accumulator that will ultimately 
contain the final 512 bit product term resulting from the 
calculation carried out by the multiplier. Clock AL signal 
131 is used to load accumulator 130 with the 32 bit sections 

40 

of the final product obtained from 32 bit shift register 126 by 
way of accumulator multiplexer 128. 

FIG. 5 is a diagram showing the connections between the 
adder arrays and the partial sum/carry save registers of the 
multiplication unit of the present invention. The figure 45 
shows the data flow between the one bit carry-save adders of 
adder array PS/CS 0 116, the one bit carry-save adders of 
adder array PS/CS 1 118, CS register 120, and PS register 
122. As shown in FIG. 5, each of adder arrays 116 and 118 
are composed of a group of one bit carry -save adders 150. 50 
PS register 122 and CS register 120 are composed of a group 
of individual registers 152. It is noted that FIG. 5 shows only 
a portion of the full set of adders 150 and registers 152 
contained in the multiplier. 55 

Each one-bit adder 150 has inputs A, B, and CI (carry in 
bit) and outputs S (partial sum) and CO (carry out bit). The 
inputs to adder array 116 are the operand A factor corre- 
sponding to the recoded value of bits 0, 1, and 2 of the 
recoded section of operand B. This factor is shown as the 60 
term AO in the figure, where A0[n] represents the nth bit of 
the term AO. The inputs to adder array 118 are the operand 
A factor corresponding to the recoded value of bits 2, 3, and 
4 of the recoded section of operand B. This factor is shown 65 
as the term Al in the figure, where Al[n] represents the nth 
bit of the term Al, 
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The appropriate bits of the factor of operand A are input 
as shown to the adders 150 of array 116. The other inputs to 
adders 150 of array 116 are the appropriate bits of PS 
register 122 and CS register 120. This implements a feed- 
back loop between the PS and CS registers and adder array 
116. This loop is used for the partial product accumulation 
function of the multiplier, and is indicated by data bus 154 
in FIG. 3. One-bit adders 150 in array 116 and array 118 are 
staggered with respect to each other, with the inputs to the 
nth adder in array 118 being provided by the outputs from 
the n-2 th adder in array 116. This connection scheme 
implements the Booth recoding shift by 2 bits factor 
required when performing a modulo 4 based calculation. 

As noted, the appropriate operand A factors are input to 
adder arrays 116 and 118. These factors are added to the 
results of the previous add operation, producing a new value 
for the partial sum and carry outputs. The least significant 
bits of the partial sum and carry out term produced by adder 
array 116 and adder array 118 (a total of 4 partial sum and 
2 carry out bits) each cycle are provided to carry look ahead 
adder 124 for combination into the 4 bit sections of the final 
product term. The remaining partial sum outputs of adders 
150 contained in adder array 118 provide the contents of PS 
register 122, while the remaining carry save outputs of 
adders 150 provide the contents of CS register 120. It is 
these terms which are provided to adder arrays 116 and 118 
during the next cycle by means of the feedback connection 
between the registers and adder arrays. 

After all of operand B has been recoded, the appropriate 
factors of operand A have been accumulated in adders 116 
and 118, and the partial sum and carry save data for each 
cycle has been transferred to adder 124, registers 120 and 
122 contain the most significant bits of the carry save 
operations performed on the operand A factors. CS register 
120 is 260 bits in size and is clocked by clock CS signal 121, 
while 260 bit PS register 122 is clocked by clock PS signal 
123. The contents of CS register 120 and PS register 122 are 
used to implement the final addition operation which pro- 
duces the upper 256 bits of the final product. 

When all of operand B has been recoded, accumulator 130 
contains the lower 256 bits of the final product. The remain- 
ing bits of the final product are obtained by adding the 
contents of 260 bit CS register 120 to the contents of 260 bit 
PS register 122. This addition is performed by 32 bit carry 
look ahead adder 132. As each 32 bit wide set of data from 
registers 120 and 122 is added by adder 132 to produce a 32 
bit group of the most significant bits of the final product, it 
is loaded into 256 bit accumulator 134 which represents the 
upper half of the 512 bit accumulator which will ultimately 
contain the final 512 bit product term. Clock AH signal 135 
is used to load accumulator 134 with the 32 bit sections of 
the final product obtained from adder 132. When accumu- 
lator 134 has been filled, both the upper and lower 256 bit 
sections of the final product are complete. 

The lower 256 bits of the final product are clocked out of 
accumulator 130 in 32 bit groups under the control of clock 
AL signal 131 and placed onto data bus 136. While the lower 
256 bits arc being placed onto the data bus, the upper 256 
bits are being clocked out of accumulator 134 by clock AH 
signal 135 and placed in 32 bit groups into accumulator 
multiplexer 128. The 32 bit groups of the upper 256 bits are 
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then passed to accumulator 130 as the lower bit groups are Registers 160, 162, and 164 are used to accommodate an 

clocked out of that register. By the time the 256 lower bits overflow carry from the look ahead adders. For example, if 

of the final product have been clocked out of accumulator the look ahead adder has a carry in the fifth bit location, 

130 and placed on bus 136, accumulator 130 has been since the present four bits of the sum are to be shifted out, 

refilled by the 32 bit groups of the upper 256 bits of the 5 tne fifth bit becomes the least significant bit for the next 

product formerly held in accumulator 134. The upper 256 c y cle - Therefore, it is input as the carry in bit for the adder, 

bits are then clocked out of accumulator 130 and placed on Multiplexers 166 are used to implement operations such 

data bus 136. In this way, all 512 bits of the final product are as dum P in g the lower h *lf of the final product term to the 

placed onto data bus 136 in 32 bit groups. 10 P rocessor data bus and shiftin & the contents of u PP er half 

TTie multiplier of the present invention can be divided into accumulator 134 to lower half accumulator 130. Multiplex- 
three functional modules: 1) operand loading module, 2) 166 t can aIso be ^ t0 ]™ d < he .^tire accumulator 
n t , t . , , . . . . , . , . (accumulator sections 134 and 130) with a 512 bit value 
Booth partial product calculation and accumulation module, , t . , c . , ' . 

« ^ \ , .... - - , L . , obtained from the processor data bus, to clear accumulators 

and 3) accumulator shift function module which builds the . , tU t 4 

- . ' .. - . . * . , . „ , r , 130 and 134, to dump the contents of the entire accumulator 

final 512 bit product. As indicated u> FIG. 3, each of the (Q , he SQf daU buS( Qr , 0 load da(a from ^ ^ 

three stages is clocked independently of the others. da(a bus and add lhat ya , 

to the contents of upper half 

Clock signals 105, 107, and 109 are used to load the accumulator 134. 

operands into shift registers 104, 106, and 108. Clock signals It is noted that carry Iook ahead adder 132 performs two 

121, 123, and 125 are used to calculate the Booth partial 20 functions during the Booth receding and accumulation 

products and assemble the segments of the product term into operations. In the partial product accumulation phase, the 4 

the 256 bit upper and lower portions of the final product bit sections of the product term are pieced together in 

which are stored in registers 134 and 130. Clock signals 135 register 126 until they form a 32 bit word. The equally 

and 131 are then used to control the assembly of the final weighted 32 bit value in lower accumulator 130 is added to 

product term from the contents of registers 134 and 130. 25 the value in register 126 and shifted into accumulator 130. 

As mentioned, many applications, such as cryptography, During the PS and CS register addition phase which forms 

require the calculation of the square of the sum of two the upper half of the product term, adder 132 is switched to 

operands. This operation can also be used as part of an add the contents of the PS and CS registers, 32 bits per cycle, 

exponentiation function, which can be decomposed into ni& 32 bit vaIue becomes one input to adder 168, which 

operations which include squaring operations. In order to adds tne ec l uall y weighted 32 bit value in upper accumulator 

increase the speed with which squaring operations can be 134 10 the sum of the PS and CS registers. This new sum is 

implemented, the present invention uses a control signal 140 lhen shifted int0 u PP er accumulator 134. TTiese steps enable 

(see FIG. 3) input to multiplexer 114 to cause the multiplier ' he ™*W* ?> P erform the operation A* B + C, where A and 

j • . . «* A *r> * . « , c B are multiplication operands and C is the contents of the 

to calculate the intermediate term 2* A* B used in the squar- 35 r r . . 

^ 512 bit accumulator at the beginning of a new multiply 

ing opera ion. cycle. The ability to perform the operation A*B+C means 

The control signal is provided by an external processor 4l _ . , . A . c *• i * i_ j . 

. . . ... . that the shifting function of the accumulator can be used to 

which interprets the multiplication and squaring operation form the squafe of (he sum of the fifSt afld Qperands 

commands input by a user and controls the operation of the 4Q from the accumulated Booth partial pro ducts. 
multiplier accordingly. The 2*A*B term is calculated by The Booth recoding melhod can be performed on signed 
performing a 1 bit shift of the term A (to form 2*A) using or unsigned numbers depending upon how the most signifi- 
circuitry contained in select multiplexer 114. This factor of cant bits of the operands are manipulated. Operand A 
2* A is then multiplied by operand B by means of the Booth becomes an unsigned value by including most significant 
recoding and partial product accumulation stage. The addi- 45 bits having a value of zero. This is why the Booth adder data 
tional circuitry required in multiplexer 114 to execute the 1 paths are 260 bits wide instead of 256 bits wide for 256 bit 
bit shift of operand A is smaller and less expensive to sized operands. Operand B becomes an unsigned value 
implement than modifying accumulator 134 to execute a when an extra recode cycle is performed and leading zeros 
shift of the entire product after completion of the multipli- ^ are included in the final recode. This offsets the significance 
cation operation. Thus, the present invention provides a of the product by four bits. This four bit offset can be 
more compact and less expensive means for performing accommodated by appropriate sequencer retiming and off- 
squaring operations than is available in currently used Booth setting the data flow, 

multipliers. The terms and expressions which have been employed 

FIG. 6 is a more detailed block diagram of multiplier unit 55 herein are used as terms of description and not of limitation, 

100 of the 256 bit by 256 bit Booth Multiplier of FIG. 3. It an d there is no intention in the use of such terms and 

is noted that reference numbers common to both FIG. 3 and expressions of excluding equivalents of the features shown 

FIG. 6 refer to the same elements. In addition to the elements an d described, or portions thereof, it being recognized that 

of FIG. 3, FIG. 6 shows registers 160, 162, and 164 which various modifications are possible within the scope of the 

are used to store and appropriately weight the carries of 60 invention claimed, 

carry look ahead adders 124, 132, and 168, respectively. I claim: 

FIG. 6 also shows multiplexers 166 which are used to 1. A method of computing the square of the sum of a first 

implement more complex loading and dumping operations operand and a second operand utilizing an integrated circuit, 

for accumulators 130 and 134. Carry look ahead adder 168 65 the method comprising: 

is a 32 bit adder which is used to add the product of the loading the first operand into a first operand data storage 

multiplication operation to an existing accumulator value. means formed as part of the integrated circuit; 
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loading the second operand into a second operand data 
storage means formed as part of the integrated circuit; 

recoding the second operand utilizing recoding circuitry 
formed as part of the integrated circuit, thereby deter- 
mining the factors of the first operand to use in forming 5 
Booth partial products of the first and second operands; 

forming the factor 2* A utilizing factoring circuitry 
formed as part of the integrated circuit, where A is the 
first operand, for use in computing the square of the 
sum of the first and second operands; 10 

forming the Booth partial products of the factor 2* A and 
the second operand utilizing partial products circuitry 
formed as part of the integrated circuit, including 
partial products of 2*A*B, where B is the second 
operand, for use in computing the square of the sum of 
the first and second operands; 

accumulating the Booth partial products utilizing accu- 
mulator circuitry formed as part of the integrated 
circuit; and 20 

forming the square of the sum of the first and second 
operands from the accumulated Booth partial products 
by performing add and shift operations utilizing add/ 
shift circuitry formed as part of the integrated claims. 

2. The method of claim 1, wherein the step of forming the 25 
factor 2* A, where A is the first operand, further comprises: 

shifting the first operand value by one bit to form the 
value of two times the first operand. 

3. A method of computing twice the product of a first 
operand and a second operand utilizing an integrated circuit, 
the method comprising: 
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loading the first operand into a first operand data storage 
means formed as part of the integrated circuit; 

loading the second operand into a second operand data 
storage means formed as part of the integrated circuit; 

recoding the second operand utilizing recoding circuitry 
formed as part of the integrated circuit, thereby deter- 
mining factors of the first operand to use in forming 
Booth partial products of the first and second operands; 

forming a factor 2* A utilizing factoring circuitry formed 
as part of the integrated circuit, where A is the first 
operand, for use in computing twice the product of the 
first and second operands; 

forming the Booth partial products of the factor 2*A and 
the second operand utilizing partial products circuitry 
formed as part of the integrated circuit, including 
partial products of 2*A*B, where B is the second 
operand for use in computing twice the product of the 
first and second operands; and 

accumulating the Booth partial products utilizing accu- 
mulator circuitry formed as part of the integrated 
circuit. 

4. The method of claim 3, therein the step of forming the 
factor 2* A, where A is the first operand, further comprises: 

shifting the first operand value by one bit to form the 
value of two times the first operand. 

* * * * * 
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